Clustering and KNN

Taotao Tan

Calculating Euclidean distance

Euclidean distance is simply the geometric distance between the two points. Recall the basic Pythagorean formula:

$$a^2 + b^2 = c^2$$

where $a$ and $b$ are sides of a right triangle and $c$ is the hypotenuse. Specifically, $a^2$ is the difference between the two points in the first dimension squared, and $b^2$ is the difference in the second dimension squared. To get the Euclidian distance $c$ we take the $\sqrt{c^2}$, or simply $\sqrt{ a^2+b^2}$.

To express Euclidian distance in multiple dimensions, we can use the same formula, but we keep adding dimensions. e.g. $a^2+b^2+c^2=d^2$. It now becomes more convenient to change our notation a bit. Let’s say we have $n$ dimensions, and let’s call them $D_1,D_2,...D_n$. Now for any two points $x=(x_1,...,x_n)$ and $y=(y_1,...,y_n)$ in $n$-dimensional space, we can write the Euclidian distance as:

$$d_{xy} = \sqrt{\sum_{i= 1}^n{(y_i - x_i)^2}}$$

Pairwise distances between genes

To find the Euclidean distance between GeneA and GeneB, let’s refer to the value of GeneA in Sample1 as $A_1$ and the value of GeneA in Sample2 as $A_2$; let’s call the value of GeneB in Sample1 $B_1$ and in Sample2 $B_2$; and so on.

The Euclidean distance between GeneA and GeneB can now be calcuated using the following formula:

$$d_{AB} = \sqrt{(B_1 - A_1)^2 + (B_2 - A_2)^2 + (B_3 - A_3)^2 + (B_4 - A_4)^2} = \sqrt{\sum_{i = 1}^{4}{(B_i - A_i)^2}}$$











Loading the data

We will subset the data to only include the columns we are interested in. Then we will separate the data into test and training set.

In [1]:
# install scipy and plotly if you haven't
# !pip install scipy
# !pip install plotly
In [56]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import linkage
from matplotlib import pyplot as plt
import plotly.express as px
import seaborn as sns
import math
In [3]:
data = pd.read_csv("all.15k.patients.txt",sep = "\t")
data.head()
Out[3]:
dateofbirth maritalstatus race ageatdiagnosis alivestatus survivaltime grade nodesexam nodespos extent nodalstatus size pgr er
pid00001 28/01/1932 5 2 61 0 110 3 32 3 10 6 60 2 2
pid00002 07/05/1934 2 3 60 0 100 2 13 1 10 6 15 1 1
pid00003 14/04/1921 5 3 76 0 70 3 8 0 10 0 8 1 1
pid00004 08/11/1931 3 3 61 1 31 3 20 0 10 0 10 2 2
pid00005 08/01/1922 2 3 74 1 47 2 16 8 10 6 15 2 1
In [4]:
# to perform z transformation, use StandardScaler().fit_transform()
size_nodspos_scale = StandardScaler().fit_transform(data[["size", "nodespos"]])

# Substitute the original data with the transformed data
data[["size", "nodespos"]] = size_nodspos_scale
# notice column "size" and "nodespos" have changed
data.head()
Out[4]:
dateofbirth maritalstatus race ageatdiagnosis alivestatus survivaltime grade nodesexam nodespos extent nodalstatus size pgr er
pid00001 28/01/1932 5 2 61 0 110 3 32 0.312479 10 6 2.461073 2 2
pid00002 07/05/1934 2 3 60 0 100 2 13 -0.156849 10 6 -0.394226 1 1
pid00003 14/04/1921 5 3 76 0 70 3 8 -0.391513 10 0 -0.838383 1 1
pid00004 08/11/1931 3 3 61 1 31 3 20 -0.391513 10 0 -0.711481 2 2
pid00005 08/01/1922 2 3 74 1 47 2 16 1.485798 10 6 -0.394226 2 1
In [5]:
# subset columns
specific_data = data[["alivestatus","grade","nodespos","nodesexam","size","pgr","er"]]

# set alivestatus as categorical
specific_data = specific_data.astype({'alivestatus': 'category'})
In [6]:
# randomly choose 5000 rows as test set
data_test = specific_data.sample(5000)

# set difference
whole_index = specific_data.index
test_index = data_test.index
train_index = whole_index.difference(test_index)

# 
data_train = specific_data.loc[train_index,]
In [7]:
sns.scatterplot(data=specific_data, x="nodespos", y="size", hue="alivestatus")
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5d4059250>











Clustering

Now that we have calculated the distance between the genes, we can group the genes that are close together. There are two main clustering methods : Hierarchical & K-means. Hierarchical is a bottoms-up approach where we first group together genes that are most similar and work our way up. K-means clustering is when you know how many clusters you need and then you try their centers.











1. Hierarchical clustering

Hierarchical clustering can be constructed in many way, but the most simplest is the UPGMA method, which is also used for constructing phylogenetic trees. Steps are:

  • Select to genes that have the shortest distance. Height of the branch is the distance.
  • Group them as an entity and recalculate its distance to other genes. This is called the linkage method. We will use the average method which means we will use the averave distance between the two elements being merged to the other element ( See example for details )
  • repeat until all genes are grouped.
In [8]:
# use sklearn.cluster.AgglomerativeClustering for prediction purpose
X = np.array(specific_data[["nodespos","size"]])
hclustering = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
hclustering.fit_predict(X)

# predicted labels:
print("The predicted labels:" , hclustering.labels_)
The predicted labels: [0 0 0 ... 0 0 0]
In [9]:
# drawing dendrogram with sklearn is difficult,
# so we use scipy.cluster.hierarchy.linkage and scipy.cluster.hierarchy.dendrogram for plotting purpose

# it's not practical to draw all the 15,000 points here, so I drew 100 samples here
linked = linkage(X[1:100,], 'complete')

dendrogram(linked, orientation='top')
plt.show()











2. K-means clustering

For K-means clustering you must first decide on a $k$. In this case let’s pick 2. This means 2 centers, representing two differnet groups, will be randomly placed in our dataset, and then points will be assigned to the group whose center they are closest to.

In [10]:
X = np.array(specific_data[["nodespos","size"]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
Out[10]:
array([1, 0, 0, ..., 0, 1, 0], dtype=int32)
In [11]:
confusion_matrix(specific_data.alivestatus, kmeans.labels_)
Out[11]:
array([[11412,   833],
       [ 1830,   925]])
In [12]:
kmean_df = specific_data[["nodespos","size","alivestatus"]].copy()
kmean_df["kmeans"] = kmeans.labels_

# plot with seaborn
sns.scatterplot(data=kmean_df, x="nodespos", y="size", hue="kmeans", style = "alivestatus")
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5d472cc90>

Let’s pretend we have a new patient who has the mean value for nodespos and mean value for size, <0,0>. How can we determine which cluster it belongs to ?

In [18]:
# our new patient
ndata = np.array([0,0])

# centroids coordinates
kmeans.cluster_centers_
Out[18]:
array([[-0.23498209, -0.23404083],
       [ 1.76998454,  1.76289456]])
In [34]:
# calculate the Euclidean distance from centroids to our new data (0,0)
dist_alive = math.sqrt(sum(np.square(kmeans.cluster_centers_[0,:] - ndata)))
dist_dead = math.sqrt(sum(np.square(kmeans.cluster_centers_[1,:] - ndata)))

# normalize the distance
norm_dist_alive = dist_alive/(dist_alive + dist_dead)
norm_dist_dead = dist_dead/(dist_alive + dist_dead)


# probability = 1 - normalized distance 
p_alive = 1 - norm_dist_alive
p_dead = 1 - norm_dist_dead

print("Probability of patient alive: ", p_alive)
print("Probability of patient dead: ", p_dead)
Probability of patient alive:  0.8828000000000022
Probability of patient dead:  0.11719999999999775

Surely we can add another dimension to our clustering which will surely help in our predictions.

In [38]:
fig = px.scatter_3d(specific_data, x='nodespos', y='size', z='grade',
              color='alivestatus')
fig.show()











K Nearest Neighbor

The strategy of K-Nearest Neighbor is to simply look at the values that are closes to it and assign the classification accordingly. $k$ stands for the number of closest neighbors it will look for and classify accordingly. For example, if $k=3$, then a test point will look for the $3$ closest neighbors. If $2/3$ are classified as $1$ and $1/3$ is classified as $0$, then the classification of the test point is simply $1$.

So there is no model to create, simply know the points and predict the rest.

You can guess that changing the $k$ will have a large effect on the it’s classification. The larger the $k$, the more smooth the separation between them, where as the smaller the $k$, the tighter the boundaries making them more prone to over-fitting.

Ideally you want to try several $k$ values to see which gives the least error.

Running K-NN

In [44]:
X_test = np.array(data_test[["nodespos", "size"]])
y_test = np.array(data_test["alivestatus"])

X_train = np.array(data_train[["nodespos", "size"]])
y_train = np.array(data_train["alivestatus"])
In [51]:
data_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
predicted = data_knn.predict(X_test)
predicted_proba = data_knn.predict_proba(X_test)
In [48]:
confusion_matrix(y_test, predicted)
Out[48]:
array([[3666,  413],
       [ 613,  308]])

Calculate the AUC

In [62]:
fpr, tpr, threshold = roc_curve(y_test, predicted_proba[:,1])
roc_auc = auc(fpr, tpr)
print("AUC of the model is: ", auc(fpr, tpr))


plt.figure()

plt.title('AUC plot')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate (1 - specificity)')
plt.show()
AUC of the model is:  0.6832422574884361
In [74]:
knn3_df = data_test[["nodespos","size","alivestatus"]].copy()
knn3_df["predict"] = predicted
sns.scatterplot(data=knn3_df, x="nodespos", y="size", hue="predict", style = "alivestatus")
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa54ee067d0>







Changing the k value

In [71]:
data_knn_100 = KNeighborsClassifier(n_neighbors=100).fit(X_train, y_train)
predicted_100 = data_knn_100.predict(X_test)
predicted_proba_100 = data_knn_100.predict_proba(X_test)
In [72]:
confusion_matrix(y_test, predicted_100)
Out[72]:
array([[3932,  147],
       [ 708,  213]])

Calculate the AUC

In [73]:
fpr_100, tpr_100, threshold_100 = roc_curve(y_test, predicted_proba_100[:,1])
roc_auc_100 = auc(fpr_100, tpr_100)
print("AUC of the model is: ", auc(fpr_100, tpr_100))


plt.figure()

plt.title('AUC plot (k = 100)')
plt.plot(fpr_100, tpr_100, 'b', label = 'AUC = %0.2f' % roc_auc_100)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate (1 - specificity)')
plt.show()
AUC of the model is:  0.772510294112558
In [75]:
knn100_df = data_test[["nodespos","size","alivestatus"]].copy()
knn100_df["predict"] = predicted_100
sns.scatterplot(data=knn100_df, x="nodespos", y="size", hue="predict", style = "alivestatus")
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa54ead3050>